AITopics | virtual policy

LearningtoConstrainPolicyOptimizationwith VirtualTrustRegion

Neural Information Processing SystemsFeb-9-2026, 00:07:39 GMT

ComparedtoDeepQ-learning,deeppolicygradient (PG) methods are often more flexible and applicable to discrete and continuous action problems. However, these methods tend to suffer from high sample complexity and training instability since the gradient may not accurately reflect the policy gain when the policy changes substantially [6].

artificial intelligence, machine learning, virtual policy, (16 more...)

Neural Information Processing Systems

Country: Oceania > Australia (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Learning to Constrain Policy Optimization with Virtual Trust Region

Neural Information Processing SystemsDec-24-2025, 05:18:10 GMT

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses two trust regions to regulate each policy update. In addition to using the proximity of one single old policy as the first trust region as done by prior works, we propose forming a second trust region by constructing another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. We propose a mechanism to automatically build the virtual policy from a memory buffer of past policies, providing a new capability for dynamically selecting appropriate trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.

constrain policy optimization, name change, virtual trust region, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.61)
Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Appendix

Neural Information Processing SystemsAug-14-2025, 21:12:54 GMT

A Method Details A.1 The attention network The attention network is implemented as a feedforward neural network with one hidden layer: Input layer: 12 units Hidden layer: N units coupled with a dropout layer p = 0 . From these three policies, we tried to extract all possible information. The information should be cheap to extract and dependent on the current data, so we prefer features extracted from the outputs of these policies (value, entropy, distance, return, etc.). Intuitively, the most important features should be the empirical returns, values associated with each policy and the distances, which gives a good hint of which virtual policy will yield high performance (e.g., a virtual policy that is closer to the policy that obtained high return and low value loss). A.2 The advantage function In this paper, we use GAE [18] as the advantage function for all models and experiments ˆ A Note that Algo. 1 illustrates the procedure for 1 actor. A.3 The objective function Following [19], our objective function also includes value loss and entropy terms.

baseline, hyperparameter, mcpo, (15 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Learning to Constrain Policy Optimization with Virtual Trust Region

Neural Information Processing SystemsAug-14-2025, 21:12:50 GMT

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.

mcpo, policy optimization, virtual policy, (14 more...)

Neural Information Processing Systems

Country: Oceania > Australia (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
Information Technology > Artificial Intelligence > Robots (0.88)

Add feedback

Learning to Constrain Policy Optimization with Virtual Trust Region

Neural Information Processing SystemsOct-11-2024, 02:09:13 GMT

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses two trust regions to regulate each policy update. In addition to using the proximity of one single old policy as the first trust region as done by prior works, we propose forming a second trust region by constructing another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. We propose a mechanism to automatically build the virtual policy from a memory buffer of past policies, providing a new capability for dynamically selecting appropriate trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.

constrain policy optimization, policy optimization, virtual trust region, (2 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.64)
Information Technology > Artificial Intelligence > Machine Learning (0.44)

Add feedback

Learning to Constrain Policy Optimization with Virtual Trust Region

Le, Hung, George, Thommen Karimpanal, Abdolshah, Majid, Nguyen, Dung, Do, Kien, Gupta, Sunil, Venkatesh, Svetha

arXiv.org Artificial IntelligenceSep-15-2022

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2204.09315

Country: Oceania > Australia (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment > Games (0.55)

Technology: